Validation study on definition of cause of death in Japanese claims data

Identifying the cause of death is important for the study of end-of-life patients using claims data in Japan. However, the validity of how cause of death is identified using claims data remains unknown. Therefore, this study aimed to verify the validity of the method used to identify the cause of death based on Japanese claims data. Our study population included patients who died at two institutions between January 1, 2018 and December 31, 2019. Claims data consisted of medical data and Diagnosis Procedure Combination (DPC) data, and five definitions developed from disease classification in each dataset were compared with death certificates. Nine causes of death, including cancer, were included in the study. The definition with the highest positive predictive values (PPVs) and sensitivities in this study was the combination of “main disease” in both medical and DPC data. For cancer, these definitions had PPVs and sensitivities of > 90%. For heart disease, these definitions had PPVs of > 50% and sensitivities of > 70%. For cerebrovascular disease, these definitions had PPVs of > 80% and sensitivities of> 70%. For other causes of death, PPVs and sensitivities were < 50% for most definitions. Based on these results, we recommend definitions with a combination of “main disease” in both medical and DPC data for cancer and cerebrovascular disease. However, a clear argument cannot be made for other causes of death because of the small sample size. Therefore, the results of this study can be used with confidence for cancer and cerebrovascular disease but should be used with caution for other causes of death.


Introduction
In recent years, the evaluation of end-of-life care has been widely conducted using claims data [1][2][3][4][5]. Claims data are routinely collected for billing purposes and offer a large longitudinal dataset to researchers [6]. However, for other uses, verifying the validity of the information recorded in the claims data is essential. Claims data are recorded for the purpose of service a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 reimbursement rather than for study purposes and contain little information on the background of patients, examination results, disease severity, and diagnoses [6,7]. Therefore, researchers must assess the validity and accuracy of key variables such as diagnosis [8]. In the US, Europe, and Asia-Pacific region, claims data have been validated to match medical records or registry data in order to provide some confidence in the use of these data for study purposes [9,10].
Claims data from the US and Europe are linked at an individual level with data on the use of home care and nursing homes, hospital data, and causes of death [1,2,[11][12][13]. In contrast, because the National Database (NDB), which stores claims data for all Japanese citizens, is restricted from being linked to external databases, we cannot ascertain the cause of death from the NDB. Several studies have evaluated the validity of Japanese claims data, including information related to diagnosis [14][15][16][17], procedure [14,18], prescription [19], and discharge [20,21].
However, the validity of identifying the cause of death in Japanese medical claims data remains unknown. A previous study reviewed the validation of claims data in the Asia-Pacific region [10] and reported that two of forty-three studies evaluated death information. Mealing et al. [22] reported a high sensitivity (92%-99%) and specificity (90.3%-97.9%) of death information by cancer type against the actual death information of a Department of Veterans' database for patients with cancer in Australia. Sakai et al. [21] reported a moderate to high sensitivity (47%-94%), high specificity (98%-99%), and high PPV (�95%) for death in claimsbased definitions of death. These values correlate with the reason for loss of insured status "died" in enrollment files by inpatients and outpatients among non-dependent persons aged 65-74 years in Japanese workplace health insurance. Additionally, Fujiwara et al. [16] reported a moderate to high sensitivity (75%-100%) and high specificity and PPV (100%) for death in claims-based definitions of death. This result was against the chart review based on electrical medical records of inpatients with cancer in a single hospital in Japan. Although several studies [16,20,21] have evaluated death information in the claims data against chart review and enrollment files, there is limited information regarding the method used to identify the cause of death and sensitivity, specificity, and PPV of the cause of death between the death certificate and claims data. The accuracy of the method for identifying the cause of death is important for identifying patients with diseases of interest in end-of-life care. Therefore, this study aimed to verify the validity of the method used to identify the cause of death based on Japanese claims data.

Study design
We conducted this two-site cross-sectional study to validate definitions of the cause of death using information recorded in claims data against the death certificate (the latter serving as the gold standard). This study was reported according to the Standards for Reporting Diagnostic Accuracy (STARD) 2015 statement (S1 Table) [23] and approved by the Institutional Review Board of Tohoku University (Approval No. 2020-1-683; approval date: April 24, 2020).

Study patients
We included consecutive inpatients aged �20 years who died at Tohoku University Hospital or Nagoya University Hospital between January 1, 2018 and December 31, 2019. We obtained death certificates and claims data from each patient's electronic medical record that was stored by the Center of medical information technology in each university hospital. We linked the death certificate and claims data by a common ID, which is contained in these datasets, using the MERGE statement of DATA step statements (S1 Fig). Exclusion criteria were as follows: 1) absence of a cause of death on the death certificate, 2) death from a disease that could not be the cause of death, 3) death from a cause other than natural causes, and 4) claims data not obtained from the Center of medical information technology in each university hospital.

Definitions of cause of death based on claims data
Japanese claims data are issued once a month, and they record treatment, prescriptions given for diseases, and discharge information [24,25]. Japanese claims data are classified into medical and Diagnosis Procedure Combination database (DPC) data [17,26]. Medical data are claims data issued by most medical institutions and are based on a system whereby points are added after each treatment. DPC data are claims data issued by acute care hospitals and include the cost of each treatment within the daily hospitalization costs set by the Ministry of Health, Labour, and Welfare.
We used two types of disease information from the medical data, in which diseases are classified into two categories: "disease" or "main disease." "Disease" refers to all diseases recorded. "Main disease" refers to the main flagged disease. In contrast, in the DPC data, disease information is classified into "greatest resource-consuming disease," "main disease," "trigger-forhospitalization disease," and four other categories [14,17,25]. We used "greatest resourceconsuming disease," "main disease," and "trigger-for-hospitalization disease." Other categories were not used because they were for comorbidities. Discharge information was recorded as "death," "cure," "termination," or "other." We validated nine causes of death (cancer, heart disease, cerebrovascular disease, pneumonia, chronic obstructive pulmonary disease, renal disease, dementia, old age, and infection). The 10th revision of the International Classification of Diseases (ICD-10) codes corresponding to each cause of death is listed in S2 Table. In this study, claims data issued in the month in which the discharge information was "death" were analyzed. This method has been used in previous studies on the NDB [27,28].
Claims data issued more than 2 months after the patient's death were excluded because these claims probably indicated that the patient could be alive and that the claims data were issued due to errors such as miscoding discharge/disease status when the claims were issued by the medical institutions [20]. We combined disease categories in the medical or DPC data and the month in which the claims data were issued and created five definitions, which were validated for each cause of death (Table 1). For example, in one case, cause of death was cancer because the discharge information in the medical data in January 2018 was "death" and the Table 1. Definitions of causes of death.

Definition Pattern
1 "Disease" in medical data + claims data issued in the month in which discharge information was "death" 2 "Main disease" in medical data + claims data issued in the month in which discharge information was "death" 3 "Greatest resource-consuming disease" in DPC data + claims data issued in the month in which discharge information was "death" 4 "Main disease" in DPC data + claims data issued in the month in which discharge information was "death" 5 "Trigger-for-hospitalization disease" in DPC data + claims data issued in the month in which discharge information was "death" Abbreviations; DPC, Diagnosis Procedure Combination.
https://doi.org/10.1371/journal.pone.0283209.t001 "Main disease" was cancer. In a second case, cause of death was cancer and cerebrovascular disease because the discharge information in DPC data in July 2019 was "death," the "Main disease" was cancer, and the "Greatest resource-consuming disease" was cerebrovascular disease.

Gold standard
In this study, we used the death certificate as the gold standard for the cause of death. In Japanese cause-of-death statistics, the cause of death is identified based on the information in the death certificate. Two researchers (ST and MM) independently identified the cause of death from information in the death certificates according to guidelines published by the Ministry of Health, Labour, and Welfare in Japan [29,30]. If the two researchers agreed on the cause of death, it was considered the cause of death. Disagreements were resolved by consensus through discussion.

Defining true positives, false negatives, false positives, and true negatives
We defined four indices based on previous studies [15,21]. True positives were defined as cases with any claims-based definition of cause of death (i.e., cause of death obtained from claims) and gold standard definitions of cause of death (i.e., cause of death identified from the death certificate). False negatives were defined as cases with no claims-based definition of cause of death but with a gold standard definition of cause of death. False positives were defined as cases with any claims-based definition of cause of death but no gold standard definitions of cause of death. True negatives were defined as cases with no claims-based definition of cause of death and no gold standard definition of cause of death.

Statistical analysis
Data related to patient characteristics are presented using standard descriptive statistics of median (interquartile range [IQR]) for continuous variables and number (%) for categorical variables by patients with or without discharge "death" on claims data. We calculated the sensitivity, specificity, PPV, and 95% confidence interval (CI) of each definition. We obtained 95% CIs of these diagnostic indices using the senspec option of PROC FREQ. We listed true positives, false positives, true negatives, and false negatives. Previous studies have emphasized PPVs and sensitivities; we discuss these two measures in this study [31,32]. S3 Table shows the numbers of true positives, false positives, false negatives, and true negatives for each cause of death. All analyses were performed using SAS software version 9.4 (SAS Institute, Cary, NC).

Patients' characteristics
The eligibility flow is illustrated in Fig 1. After the three exclusion criteria were applied, the final number of patients was 1706 (93.4%). The median (IQR) age of the patients was 71.0 (61.0-79.0) years. The number of patients with both medical and DPC data was 1283 (75.2%). Only 228 patients (13.3%) had medical data only and 195 (11.4%) had DPC data only. Altogether, 1511 patients had medical data and 1478 patients had DPC data. A total of 81.3% (1387/1706 patients) were discharged with "death," with 30.4% (460/1511 patients) in the medical data and 62.7% (927/1411 patients) in the DPC data. Regarding the characteristics of patients with/without the death information in claims data, patients without the information were more likely to have cancer and die in palliative care units than patients with the information ( Table 2).
The Kappa coefficient was 0.82 when the two researchers identified the cause of death from the death certificates. Based on the gold standard cause of death from the death certificate, the most common cause of death was cancer (66.4%), followed by heart disease (6.9%), cerebrovascular disease (4.3%), and pneumonia (2.6%) ( Table 2).

Discussion
To the best of our knowledge, this multicenter cross-sectional study is the first to develop and validate definitions to identify the cause of death in Japanese claims data. Although several studies have evaluated the validity of information such as diagnosis [14][15][16][17], procedure [14,

PLOS ONE
Validation study on definition of cause of death in Japanese claims data 18], prescription [19], and discharge information [20,21] in Japanese claims data, there has been limited information regarding the method used to identify the cause of death. The major finding of this study was that the cause of death due to cancer and cerebrovascular disease in Japanese claims could be satisfactorily identified using definitions where PPVs were 89%-96% in cancer and 71%-90% in cerebrovascular disease. Contrary to a previous study that evaluated sudden cardiac death [33], our results for heart diseases showed low PPV. This was because the participants were patients with opioid prescriptions who were validated based on chart review and death certificates in the previous study. Additionally, our results showed that the sensitivity and PPV for the "Main disease" in the DPC were more likely to be accurate than those for the "Greatest resource-consuming disease" in the DPC. While the cause of death from the death certificate was likely to match with the "Main disease" in the DPC, which was a condition given by the physicians, it might show a different trend from the "Greatest resourceconsuming disease" in the DPC, which was a condition responsible for the greatest use of medical resources for several characteristics, such as diagnoses, comorbidities, and complications, given by the physicians [17]. Therefore, based on our results, we make two recommendations for future studies examining end-of-life care using the Japanese claims database. First, using medical data, it might be useful to use Definition 2 ("Main disease" in medical data + claims data issued in the month in which discharge information was "death"): for cancer, heart

PLOS ONE
Validation study on definition of cause of death in Japanese claims data The interesting finding of this study was that approximately 20% of patients were missing on from the discharge with "death" data and reported characteristics. Compared to previous studies [20,21], we showed a different proportion of missing death information among inpatients. Sakai et al. [21] reported death information was sensitivity of 94.4% by inpatients and 47.4% by outpatients. Medical institutions may omit recording of patient deaths in the claims data due to lack of motivation, as these data are primarily collected for reimbursement purposes. Patients without death information in the claims data were more likely to have cancer and die in palliative care units, than patients with information in the claims data. Additionally, there were no palliative care units at hospital B. Thus, it might vary among departments and hospitals. A previous study showed similar findings regarding the difference in results of the diagnosis of comorbidities between four hospitals [14]. However, our study corroborates the proportion of missing death information and the characteristics of patients with those shown previously, which we believe will be useful for future database studies.
The findings of this study can be applied to the NDB. Several studies have assessed end-oflife care using the NDB in Japan [27,28,34,35]. Even though the accuracy of the method used in previous studies to identify the cause of death is important for the identification of patients with diseases of interest in end-of-life care, the method has not been validated. Therefore, the results of this study can contribute to a more accurate identification of terminal patients in future studies on end-of-life care using the NDB.

Limitations
This study had several limitations. First, the sample size may have been insufficient. Although we used death certificates for 2 years and claims data from two institutions, more than half of the deaths were from cancer, and the sample size for other diseases was small. Therefore, it is necessary to increase the sample size for each cause of death in future studies.
Second, a clear definition cannot be recommended for patients with both medical and DPC data. The results of this study can be applied to patients with medical or DPC data. However, PPVs for Definitions 2 and 4 do not differ significantly; therefore, it is possible to identify patients using either definition, although a new definition needs to be developed. Third, there might be measurement errors in the information on death certificates as the gold standard. For example, Mieno et al. [29] evaluated that the concordance rate was relatively high for cancer (81%) but low for heart disease (55%) and pneumonia (9%), which reported on death certificates against a reference standard of pathologist assessment based on autopsy data and clinical records. Therefore, the cause of death identified in this study might not be the true cause due to measurement error. However, it was difficult to obtain pathological autopsy results as the gold standard for the sample size of this study. Therefore, in this study, we considered death certificates as a plausible gold standard.
Finally, the variables used in this study were insufficient. In this study, only the disease categories and discharge information were used. Japanese claims data include other information such as treatment details and hospital charges. Sato et al. [15] identified breast cancer with a high probability in Japanese claims data by combining breast cancer diagnosis and treatment procedure codes. In the future, it will be necessary to search for variables that improve the accuracy of each cause of death and to conduct analyses that include these variables.

Conclusions
This study validated a method for identifying the cause of death using Japanese claims data with sufficient accuracy. The results showed that for cancer, heart disease, and cerebrovascular disease, Definition 2 in the medical data and Definition 4 in the DPC data exhibited high PPV and sensitivity. For heart disease, PPV was below 70% for all definitions. Also, we could not make a clear argument for other causes of death because of the lack of samples. Therefore, the definitions of cause of death using the claims data identified in this study can be used with confidence for cancer and cerebrovascular disease but should be used with caution for other causes of death.